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Abstract 

High-dimcnsional data analysis has been an active area, and the main fo- 
cuses have been variable selection and dimension reduction. In practice, it 
occurs often that the variables are located on an unknown, lower-dimensional 
nonlinear manifold. Under this manifold assumption, one purpose of this paper 
is regression and gradient estimation on the manifold, and another is developing 
a new tool for manifold learning. To the first aim, we suggest directly reducing 
the dimensionality to the intrinsic dimension d of the manifold, and perform- 
ing the popular local linear regression (LLR) on a tangent plane estimate. An 
immediate consequence is a dramatic reduction in the computation time when 
the ambient space dimension p » d. We provide rigorous theoretical justifica- 
tion of the convergence of the proposed regression and gradient estimators by 
carefully analyzing the curvature, boundary, and non-uniform sampling effects. 
A bandwidth selector that can handle heteroscedastic errors is proposed. To 
the second aim, we analyze carefully the behavior of our regression estimator 
both in the interior and near the boundary of the manifold, and make explicit 
its relationship with manifold learning, in particular estimating the Laplace- 
Beltranii operator of the manifold. In this context, we also make clear that it 
is important to use a smaller bandwidth in the tangent plane estimation than 
in the LLR. Simulation studies and the Isomap face data example are used to 
illustrate the computational speed and estimation accuracy of our methods. 
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1 Introduction 



High-dimensional data arise frequently in many fields of the contemporary science. 
In addition, it is common that the sample size is small relative to the dimension- 
ality of the data. Such intrinsically complex data structure introduces new chal- 
lenges in statistical analysis and inference, and requires innovative methods and 
theories [13, 17]. In this context, we focus on the regression problem, which plays 
an important role in understanding the relationship between the response variable 
and the predictors. Conventionally, the probability density function (p.d.f.) of the 
predictor vector is assumed to be non-degenerate. In this case, variable selection 
and dimension reduction are fundamental issues and have been extensively studied 
[12, 14, 41, 13, 15, 23, 38, 39]. However, these problems remain difficult in the non- 
parametric regression setting, because commonly the models are built in the ambient 
space and the curse of dimensionality is a serious issue [20, 10, 44]. 

Recently, it has been noticed that, in practice, the predictor vector often takes on 
values in a lower-dimensional, nonlinear manifold. More specifically, in the cryo Elec- 
tron Microscopy problem [16], the images are located on the 3-dimensional manifold 
5*0(3); in the radar signal example the data can be modeled as being sampled from 
the Grassmannian manifold [6]; natural images are argued to be lying on a Klein bot- 
tle [4]; the general manifold model for image and signal analysis is considered in [31]; 
and spherical, circular and oriental data are distributed on special types of manifolds 
[25]; to name but a few. Based on the manifold assumption, in the past few years, 
numerous papers have been devoted to learning the manifold, or more generally the 
underlying structure [7, 21, 36], and a few have addressed regression on manifolds 
[30, 3, 1]. 

In the manifold learning literature, the Nadaraya- Watson kernel regression esti- 
mator has been used to construct an estimator of the Laplace-Beltrami operator of 
the manifold; however, to avoid the boundary blowup problem, Neuman's boundary 
condition is required [7]. When the p-dimensional predictor is non-degenerate in MP, 
it is well known that the asymptotic bias of the traditional LLR in the Euclidean 
setup is related to the Laplacian of the regression function and that it alleviates the 
boundary effect [34]. Thus, it is interesting to see if these properties still hold for 
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some properly constructed LLR in the manifold setup, as it will enable us to obtain 
a new estimator for the Laplace-Beltrami operator of the manifold with a different 
boundary condition. 

Besides, due to the rich geometric structure, when the predictors are concentrated 
on a manifold, regression models that taking into account the geometric structure of 
the manifold are intuitively appealing. In [30, 24] the kernel regression estimator 
is constructed directly on the manifold, using the true geodesic distance both in 
determining the nearest neighbors and in constructing the kernel weights. Another 
approach is to employ the usual LLR in the ambient space MP with regularization 
imposed on the coefficients in the directions perpendicular to a tangent plane estimate 
[1]. However, there are several interesting and important issues left unsolved. First, 
although the idea of constructing kernel estimators on the manifold in [30, 24] is 
appealing, it is unrealistic to make use of the geodesic distance. It is non-trivial to 
construct LLR on the manifold without knowing the manifold structure. Second, it 
remains unknown whether the methods in [1] alleviate the boundary effect, and it is 
not obvious whether the asymptotic biases have any connections with the Laplace- 
Beltrami operator of the manifold. Third, when p is large, fitting LLR in MP as in [1] 
can be computationally expensive even if regularization has been imposed. Fourth, 
in [1] the bandwidth used in the tangent plane estimation is the same as the one 
employed in the LLR. It is unclear if we can benefit from using different bandwidths 
in these two steps. Fifth, the quantity "exterior derivative rf^/Uo" in [1; (4-5)] is subtle 
and the details are missing. Furthermore, the topology of the embedded manifold, 
in particular, the condition number [29], is another important issue that needs to be 
taken care of. 

Motivated by the above observations, in this paper, we explore further the Rieman- 
nian geometric structure of the manifold, in particular the tangent bundle structure, 
and construct the LLR directly on an estimate of the tangent plane to the manifold, 
without knowing the geodesic distance and manifold structure. Specifically, we first 
estimate the intrinsic dimension d, and deal with the condition number issue when 
determining the nearest neighbors using the Euclidean distance. Subsequently, we ob- 
tain an estimate of the embedded tangent plane based on local principal component 
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analysis (PCA). Finally, we construct the LLR on the tangent plane estimate using 
the coordinates of the nearest neighbors with respect to the orthonormal basis. We 
call our approach the Manifold Adaptive Local Linear Estimator for the Regression 
(MALLER). In addition, we suggest a procedure for selecting the bandwidth in the 
regression step that can handle heteroscedastic errors, which arise often in practice. 
A consequence of the proposed MALLER is an estimator for the gradient and the 
Laplace-Beltrami operator of the manifold. 

Throughout this paper the dimension p is kept as a fixed number and we assume 
the predictors are observed without any noise. Thus, if the sample size n is large 
enough compared to the intrinsic dimension d, the tangent plane can be estimated 
accurately so that the dimensionality of the data can be reduced from p to d. Under 
this circumstance, the first consequence is a much more computationally efficient 
scheme when p is large and p ^ d, since all the computations in the regression 
step depend only on d. Another consequence is the ability to handle the practical 
situations where n is less than p, in which case no sparsity conditions like those in 
[1] are needed for MALLER to work. The isomap face data analysis illustrates these 
points. 

We provide detailed theoretical justification of the convergence of MALLER by 
carefully analyzing the curvature, non-uniform sampling and boundary effects. In par- 
ticular, the MALLER and gradient estimators achieve the respective optimal rates of 
convergence pertaining to nonparametric regression on (i-dimensional manifolds. In 
addition, the subtle relationship between the bandwidth used in the tangent plane 
estimation and the one used in the LLR is made explicit: it is crucial that the former 
should be of a smaller order than the latter, otherwise larger biases are introduced 
in the LLR on the tangent plane estimate and in the Laplace-Beltrami estimator 
mentioned below. This issue is particularly important when estimating the Laplace- 
Beltrami operator. Moreover, MALLER enjoys both the automatic boundary correc- 
tion and the design adaptive properties possessed by the LLR in the M'^ setup [34]. 
These properties have strong implications in manifold learning. In particular, if the 
manifold has a smooth boundary, the Laplace-Beltrami operator estimated by our 
method MALLER is different from the one estimated by employing the Nadaraya- 



4 



Watson kernel method, in the sense that the two are under different boundary condi- 
tions. Since the main focus of this paper is regression on manifolds, further theoretical 
properties and applications of the new estimator of the Laplace-Beltrami operator are 
left as a future work. 

The rest of this paper is organized as follows. The proposed MALLER algorithm 
and a bandwidth selection procedure are introduced in Sections 2 and 3 respectively. 
Asymptotic results for the conditional mean squared errors of MALLER and the gra- 
dient estimator in both the interior and boundary of the manifold are given in Section 
4. In Section 5 we examine finite sample performance of MALLER and compare it 
with those of [1] through one simulation study and application to the isomap face 
dataset, and we demonstrate the efficacy of our gradient estimator via a simulated 
example. Section 6 gives a brief introduction of the diffusion map framework and 
discusses application of MALLER to estimating the Laplace-Beltrami operator of the 
manifold. In Section 7, besides addressing the relationship between MALLER and 
the NEDE algorithm in [1, (4.6)], we discuss various related open questions and fu- 
ture directions in both regression on manifolds and manifold learning. Proofs of the 
theoretical results can be found in the Supplementary, which also contains a brief in- 
troduction to the exterior derivative, covariant derivative and gradient of a function 
on the manifold. 

2 Model and Estimation Procedure 

Let Y denote the scalar response variable and let X be a p-dimensional random vec- 
tor. Assume that the distribution of X is concentrated on a cZ-dimensional compact, 
smooth Riemannian manifold M embedded in W via t : M M^, where M may have 
boundary. We consider the following regression model 

Y = m{L-\X)) + a{L-\X))e, (2.1) 

where e is a random error independent of X with E(e) = and Var(e) = 1, and both 
the regression function m and the conditional variance function o"^ are defined on M. 

Let {(X/, y/)}JL]^ denote a random sample observed from model (2.1) with X : = 
{^«}r=i being sampled from X. Then, given x G M, the problem is to estimate 

5 



nonparametrically m{x), and its higher order covariant derivatives at x if m is smooth 
enough, based on {{Xi, Yj)}"^]^. Here, x may or may not belong to X. For the sake of 
clearness, we should distinguish between the point x G t(M) and the point l^^{x) G M. 
However, to simplify the notation, for the rest of this paper we use the same symbol 
X to denote x G ^(M) or r^{x) G M and use X to denote X G 6(M) or r-^iX) G M 
unless there is any ambiguity in the context. In addition, throughout this paper we 
assume that the sample size n ^ d and X is not contaminated by error. In the 
following subsections we discuss the steps in the MALLER algorithm : (1) estimating 
the intrinsic dimension d of the manifold, (2) determining the true nearest neighbors 
of X on M using the Euclidean distance, (3) estimating the embedded tangent plane 
by local PCA, and (4) constructing LLR on the embedded tangent plane estimate. 
Before going into the details, the MALLER algorithm is summarized below. 
The MALLER Algorithm: 

1. Calculate the MLE intrinsic dimension estimate d in [22], and treat it as d. 

2. For the given x, h^ca and h determine A/^^^^^ and AQj'^'^, the two sets of estimates 
of the true nearest neighbors of x on M within a Euclidean ball of radius ^Jh^ca. 
and y/h respectively, which are defined by (2.2). 

3. Employ the local PCA based on the points in A/'^'^p^^ to get an orthonormal 
basis {Uk{x)}'l^^ for the embedded tangent plane estimate at x, thus obtaining 
{^Ci}JLi5 the coordinates of the projections of {Xi — x}"^^ onto the affine space 
spanned by {Uk{x)}'l^^ with respect to this basis. See Section 2.3 for the details. 

4. For given kernel K and bandwidth obtain (3^ by the LLR (2.4) based on 
{aj; : G AQ'^ft^'^}. Then we can compute the regression, embedded gradient and 
covariant derivative estimators defined in (2.9), (2.10) and (2.11) respectively. 

2.1 Intrinsic dimension estimation 

Given the manifold assumption, in general the intrinsic dimension d of the manifold 
M is unknown a priori and needs to be estimated based on the sample X. There exist 
many methods for estimating the intrinsic dimension and we have picked the maxi- 
mum likelihood estimation (MLE) method introduced in [22] to estimate d and denote 
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the estimated dimension by d. Since d <^n, we assume the estimated dimension d is 
correct and hence will not distinguish between d and d. 

2.2 Determining the nearest neighbors 

Numerically determining the neighbors of x G M using the Euclidean distance is 
problematic due to the embedding structure of the manifold, that is, the condition 
number of the embedded manifold [29]. The reach of M is defined as the largest 
number r > so that for every < r < r, the open normal bundle of M of radius 
r is still embedded in W. Since M is assumed to be compact, we know r > 0. The 
quantity 1/r is referred to as the "condition number" of M [29] . For the given s G M 
and any 5 > 0, denote respectively the set of Euclidean v^-neighbors of x from X 
and the set of geodesic -\A-neighbors of x from X as 

^^^^ = {Xj G X : \\Xj - x\\rp < V6} and A^J = {Xj G X : d{Xj,x) < Vd}, 

where d{-, ■) is the geodesic distance. When 6 is small enough, it is shown in Lemma 
A. 2. 4 in the Supplementary that Af^g is roughly the same as Af^^, which is the main 
fact rendering the whole algorithm feasible. However, when \/5 exceeds 2r, A/'J^ 
might be a strict subset of A/'jf^. See Figure 1. This fact combined with the lack of a 
priori knowledge of M, in particular, the geodesic distance and the condition number 
1/r, lead to the problem. Since the manifold structure is our main concern, we need 
to learn Af^^. The problem is thus reduced to determining which points in Af^l are in 
Af^g and which are not. To cope with this problem, we apply the "self-tuning spectral 
clustering" algorithm [40] to the set A/^f^. We denote 

J\f^J'' ■= {Xj G A/'5 : Xj is in the same cluster as x}. (2.2) 

Then, according to Lemma A. 2. 4 in the Supplementary, A/"^™^ is an accurate estimate 
oiAf^s- 

2.3 Embedded tangent plane estimation 

Write the tangent plane of the manifold at a; G M as T^M. Denote by l^: the total 
differential of l and by i^Tj-M the embedded tangent plane in W. Note that i^T^M is 

7 



Figure 1: Condition number. A 1-dim manifold M (blue curve) is embedded in W with 
the condition number 1/r. For the fixed x G M, the black circle is of radius and is 
centered at x. The Euclidean v^-neighbors of x, AAjf^, consists of both the red and green 
crosses. However, the geodesic v^-neighbors (true neighbors) of x, .Af^g, consists of only 
the red crosses but not the green crosses. 

a (i- dimensional affine space inside W which is tangential to M at x. Next, we find 
an orthonormal basis of an approximation to the embedded tangent plane i^T^M. 
Fix hpca. > 0. Assume that there are points in AQ^^^^^^ and rewrite them as 

■AC:Ca = {^^i'---'^^^J- Let 

1 T 
1=1 

be the sample covariance matrix of ■A/'*''^^^^, where fix is the sample mean of J^x^h^cs.- 
Denote by {Uk{x)}f.^-j^ the eigenvectors corresponding to the d largest eigenvalues of 
J^x, where Uk{x) is a p x 1 unit length column vector and d is the dimension of the 
manifold M, and define a p x d matrix 

Bx:=[U,{x) ... U,{x).] (2.3) 
Let xi = {xi^i, xi^df ■= Bl{Xi - x), for / = 1, . . . , n. 

2.4 Local linear regression on the tangent plane 

Choose a kernel function ii" : [0, oo] — > M so that -ft'lp,!] G (^""^([0, 1]) and -ft'|(i,oo] = 
and a bandwidth h > 0. Notice that h is different from hpca- We solve the regression 
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problem (2.1) at x via considering the following local linear least squares fitting on 
the estimated tangent plane: 

n d 2 

= argmin V f - /3o - V l3kXi,k) lxtru.{Xi)Kh{Xi, x), (2.4) 

where /3 = (/3o, A, • • • , /3d)^, Kn{Xi,x) := h-'/^KdlXi - x\kp/Vh), and I is the 
indicator function. Denote 

Y = {Yi,...,Y^f and m = {mir^Xi)), . . . ,m{r\Xn))f . (2.5) 

Denote by X^,. the n x {d + 1) design matrix related to x: 



1 ... 1 



(2.6) 



. . . Xfi 

and the kernel weight matrix: 

= diag (^Kh{Xi, x) Vt.uo(Xi), . . . , Kh{Xn, x)I^t.uc(X„)) , (2.7) 
which is a diagonal matrix of size n x n. Then (2.4) can be written as 



= argmin(l^ - X,/3)^W,(Y - X,/3). (2.8) 
It is straightforward to show that the minimizer in (2.8) is 

if (X^WajX^;)"^ exists. The invertibility of X^W^^X^; will be shown in the Supplemen- 
tary. Our estimator of m{x) MALLER is given by 

mix, h) ■= vl^P, = (X^W,X,)-iX^W,r, (2.9) 

where Vk G W^'^^ is a {d+1) x 1 unit vector with the k-ih entry being 1. If the interest 
is to estimate the embedded gradient of m at x, the following estimator is considered: 

d 

i^gradm(x) := ^ Va^(^)m(x, /i)f/i(x). (2.10) 

i=l 

where grad denotes the gradient, 



Va,(^)m(x, h) := (2.11) 
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and {di{x)}f^i is the orthonormal basis of T^M closest to the estimated orthonormal 
basis {f/fe(a^)}fc=i in the sense described in Lemma A. 2. 6 in the Supplementary. We 
mention that the gradient on the manifold is closely related to the covariant derivative 
and the exterior derivative. The relationship between these quantities is summarized 
in the Supplementary. 

From (2.6) and (2.8) we can see that the key ingredient in the estimators (2.9), 
(2.10) and (2.11) is finding the coordinate of a given point related to a chosen basis and 
approximate locally the regression function by a linear function of that coordinate. A 
consequence of this fact is dimension reduction. Indeed, since d may be much smaller 
than p, having obtained {a;/}[L^, locally at x we convert the p-dimensional regression 
problem to a (i-dimensional one, by paying the price of additional sampling error 
coming from the tangent plane approximation and the curvature of the manifold. 
Nonetheless, it is shown in Section 4 and Section 5 that the effect of this extra 
sampling error on the MALLER is negligible and does not contribute to the leading 
term in the estimation error, provided that hpca, is smaller than h. 

3 Bandwidth Selection 

Selection of the local PCA bandwidth hp^s. is a less important problem than choosing 
the bandwidth h in the regression step, as it is discussed in Section 4 that /ipca should 
be smaller than h and of a smaller order than the optimal order of h. We refer to [36] 
for selection of /ipca- Suppose that for a given choice of /ipca? the tangent plane estimate 
has been obtained. The aim is finding the optimal value of h so as to minimize the 
asymptotic conditional MSE of the MALLER, which is provided in (4.5). When the 
random errors are homoscedastic, the modified generalized cross-validation (mGCV) 
suggested in [3] can be used. Specifically, let 'Hmccv = {Ai,...,Ab} be a set of 
candidate bandwidths, where Xi > 0, i = 1, . . . , B, and B E N, and for each point 
X we choose a block of data points {{Xj,Yj)}j^j. For each h G HmGCV, define the 
mGCV of h by 

mGCV(/i) = (^1 + 2atrj(/i)) — ^ (^Fj -m(Xj,/i))^ 
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where ati j{h) := ^ ^jej '"i i^Xj^ Xj^Xj)~^Vih~^/'^ K (0) , rii is the number of points 
in J', and m{Xj,h) is the MALLER (2.9) of m{Xj) based on bandwidth h. Then 
^mGCV,m IS chosen as the value of h in "Hmccv which minimizes mGCV(/i). 

In the presence of heteroscedastic random errors, we adopt the following additional 
step to deal with the bandwidth selection problem. Note that the optimal bandwidth 
has to balance between the conditional bias and the conditional variance, which de- 
pends on cr^(x). Thus, with the pilot mGCV bandwidth /imGCV.m we get the first 
estimate of m{Xi) by the MALLER, denoted as rh{Xi, /imccv,™), ^ = I, ■ ■ ■ ,n, and 
we apply the method suggested in [5] to estimate (T^(x). We choose this method since 
the random error e might have a heavy tailed distribution. Defining the residuals as 

ri := (Yi - rh{Xi, /imGCV,m)) , / = 1, • • • ,n, 

we evaluate the following minimization problem 

{ao{x),a{x)) = argmin {log{fi+l/n)-ao-a'^B'^{Xi-x)YKh^^^^^y,{Xi,x), 

' ^>'»mGCV,f 

where /imGCV.f is the bandwidth determined by minimizing the mGCV upon the data 
set {{Xi,\og{ri + 1/Ti))}"=i- The estimated value of ct^(x) is then defined as 

n -1 

a^(x) := e'^«(^) 



1 " 

n ^-^ 

1=1 



Finally we select the bandwidth for MALLER given in (2.9) at x G M. Denote the op- 
timal bandwidth at x as hopt{x). Fix a candidate bandwidths set "Hopt = {^1, • • • 7 ^b}, 
which may be different from "Hmccv; where B E N and Aj > 0, i = 1, . . . ,B. For 
each h G "Hopt, estimate the conditional bias and the conditional variance of rh{x,h) 
respectively by 

b{x, h) = 2[rh{x, h) — rh{x, h/2)], 

which is based on the asymptotic bias expression given in (A. 30) of the Supplementary 
and (4.10), and 

v{x, h) = i;f(X^W,.X,.)-'X^W,©,W,.X,.(X^W,.X,.)~'i^i, 

which is based on the finite sample variance expression given in (A.31) of the Supple- 
mentary, where is a nxn diagonal matrix = diag{(3"^(Xi), . . . ,o"^(X„)}. The 
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conditional MSE of m{x, h) is then estimated by 

MSE(x, h) := b{x, hf + v{x, h). 

The value of h E Tiopt, denoted as hopt{x), which minimizes MSE{x,h) is then used 
to approximate hopt{x). With hopt{x), we can evaluate rh{x, hopt{x)). We do not 
claim the optimality of the bandwidth selection in this algorithm. For example, when 
the point x is near the boundary of the manifold, the bandwidth should be chosen 
differently. We choose this bandwidth selection scheme since it is commonly used and 
is easy to implement [33, 11]. Further study on the bandwidth selection problem in 
the manifold setup is an important and open problem and is out of the scope of this 
paper. 

4 Theory 

Before stating the main theorems describing the behaviors of the proposed MALLER 
given in Section 2, we set up more notation. Recall the assumption in Section 2 that 
M is a (i-dimensional compact smooth Riemannian manifold embedded in MJ' via t. 
Let the metric (yf on M be the one induced from the canonical metric of the ambient 
space W. The exponential map at a; G M is denoted as exp^,. Denote by d{x, y) the 
distance between x, y G M. The volume form on M induced from g is denoted as dV . 
Given 5 > 0, denote the set of points close to the boundary with distance less 
than 5 as 

= {x e M : min < 5}. (4.1) 

When 5 > is small enough, we denote the geodesic ball with radius 5 and center 
X G M as BY{x). Denote Bf{x) as the ball in M"^, g G N, with radius 6 and center 
X G M'' and S'^^^ as the standard q — 1 sphere embedded in M'^ with the induced 
metric. Define 

5f (x) := {Bf{x) n 6(M)) C M, (4.2) 

which is an approximate of the geodesic ball Bf^{x). Denote by V the Levi-Civita 
connection, A the Laplace-Beltrami operator and Hess the Hessian operator of (M, g). 
Denote by Ric the Ricci curvature of {M,g). The second fundamental form of the 
embedding t at x is denoted by Ha;. 
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4.1 Assumptions 

Let the random vector X : i7 — t- be a measurable function with respect to the 
probabihty space {Q, J-", P). To make the definition clear, in this paragraph we make 
clear the role of l to distinguish between x E M and l{x) G l{M). Suppose the range of 
X is supported on l{M). In this case, the p.d.f. of X is not well-defined as a function 
on MP if the intrinsic dimension d of M is less than p. To define properly the p.d.f. of 
X, let B be the Borel sigma algebra of t(M), and denote by Px the probability measure 
of X, defined on B, induced from P. Assume that Px is absolutely continuous with 
respect to the volume measure on 6(M), that is, dPxix) = f{r^{x))iAV{x), where 
/ G C^(M). Thus, for an integrable function ( : l{M) — )■ M, we have 



where the second equality follows from the fact that Px is the induced probability 
measure, and the last one comes from the change of variable x = L{y). In this sense 
we interpret / as the p.d.f. of X on M. 

The kernel function K : [0, oo] — t- M used in the proposed MALLER is assumed 
to be compactly supported in [0, 1] so that -ft'lp,!] G (^""^([0, 1]). Denote 



and we normalize K so that jj^i^ = 1. Note that we can also consider more general 
kernel functions. For example, any C^(]R) function with proper decaying property 
can be chosen. More general bandwidth like a positive definite symmetric bandwidth 
matrix H considered in [34] can also be considered. Since the analysis under these 
more general conditions is the same except for the wrinkle caused by the extra error 
terms, we focus on the above setup to make the analysis clear. 
We make the following assumptions in the analysis. 

(Al) /i — )• and nh'^^'^ — )• oo as ri — > oo. 



EC(X) = / aX{co))dPiu) = [ ax)dPx{x) 




[ ax)fir\x))iAVix) = [ my))fiy)dViy), 
Jm Jm 



(4.3) 




(A2) / belongs to ^2 



(M) and satisfies 



< inf /(x) < sup/(x) < oo. 



(4.4) 
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(A3) For every given h > and every point x G M^^, the set B^{x) fl M contains a 
non-empty interior set. The purpose of this assumption is to avoid the potential 
degeneracy near the boundary. 

(A4) Assume that hlil, < min(2r, inj(M)) and h^^'^ < min(2r, inj(M)), where inj(M) 
is the injectivity radius of M and 1/r is the condition number of M [29]. Please 
see step 2 of the algorithm for precise definition of r. 



4.2 Main Theory 

We state our main theorems here and postpone the proofs to the Supplementary. 
Theorem 4.1. Suppose hpca x 72"^/'^'^+^) and h > hpca- When x G M\M^, the 



conditional mean square error (MSB) of the estimator m{x, h) is 
MSE{rh{xMX} = h-'^iAmix))' + ^ ^^'^'''^"'^ 



+ 0{h' + h'hlit) + 0J^ 



+ 



1 



(4.5) 



-2 ' ' ^3/2/^3d/4^ 

Next, we consider the case when x is close to the boundary. To ease the notation, 
for X G and h > 0, define a (rf + 1) x (d + 1) matrix z/j^a,: 



,x,12 '^i,a;,22 



Vh 



/4.S)(.) K\\\u\\)udu /^^^^, K\\\u\\)uu^du 



(4.6) 



where for i = 1,2, Ui^x,!! ^ ^i,x,i2 is a 1 x d matrix, ^1^^,22 is a d x d matrix and 

D(a;) := exp;^(5^(a;) n M) C T,M. (4.7) 

We also define 



C :-- 



1 
L hhd 

Here, denotes the k x k identity matrix for any A; G N. 

Theorem 4.2. Suppose x G M^, hpca 
MSB of the estimator rh{x, h) is 



(4.8) 



n Qfi(i fl > hpca- The conditional 



MSE{fk(x, h)\X} = llHHesM.).,,^:n)f ^ vJu^>^p>^aHx) ^^^^ 



4 



l,x,ll 



nhi 



fix) 



+0 ( h^'^h^'^ + /^^/2/,2^ ,q( \ , L_ 

^'^Vy'-pca'l^ ^ "'pea"' J ^ '^P\^l/2fid/A-2 ^ nh'^/'^''^ 



+ 



/2 ' ^3/2/^3d/4 
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Notice that in both Theorem 4.1 and 4.2, the minimum of the conditional MSE 
is achieved when h x n~'^^^'^~^^\ which is strictly larger than /ipca- 

Corollary 4.1. Suppose dM is smooth, x G M^, hpca x ?7,~2/(c(+i) > hp^a- 

Then the conditional bias of 7h{x, h) is asymptotically a linear combination of the 
second order covariant derivative of m: 

(4.10) 

where {d}^^^^ is a normal coordinate determined in Lemma A. 2. 6 of the Supplemen- 
tary and Ck{x) is uniformly bounded for all k = 1, . . . ,d. 

Recall that when the p.d.f. of the random vector X is well-defined on W, de- 
noted as /, so that supp/ satisfies some weak conditions, it is shown in [34] that the 
conventional LLR is unbiased up to the second order term even when x is close to 
the boundary. Additionally, the LLR is design adaptive, that is, the asymptotic bias 
does not depend on /. These properties render the LLR popular in applications. In 
the degenerate case i.e. X lies on the manifold M, we can see from the proofs of 
Theorem 4.1 and Theorem 4.2 that MALLER also processes these nice properties. 
There properties of MALLER have important implications from the manifold learning 
viewpoint, which will be discussion in Section 6. 

4.3 Gradient and Covariant Derivative Estimate 

When the p.d.f. / of X is non-degenerate on W, it is well known that the traditional 
LLR provides an estimate of the gradient of m [34, 11]. In the manifold setup, the 
notion of differentiation is generalized naturally to the "covariant derivative", and 
hence the gradient if the manifold is Riemannian. A brief introduction of the notion 
of covariant derivative, gradient, exterior derivative and their relationship is provided 
in the Supplementary A.l. In this subsection, we show that MALLER provides an 
estimate of the covariant derivative of m. 

Theorem 4.3. Suppose x E M\M^, hpca ^ ■ri~2/('i+i) ^^^^ > hpca- The conditional 
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MSE for the estimator V Si{x)''T^{x , h) given in (2.11) is 



MSE{V^m{xM^} = h^ 

1 dfX2,2Cr'^ix)f{x) 



Aii,2 VoJ(x) ^ , , fii.2d 9^Hessm{x)99Vef{x)d9 
Am[x) — 



d f{x) 



where {di{x)}f^i is an orthonormal basis of T^M described in Lemma A. 2. 6 of the 
Supplementary. 

Theorem 4.4. Suppose x G M^, hp^a x 77,-2/(^+1) ^^^^ ^ y hp^a- The conditional 
MSE for the estimator V d.(^x)n^ix , h) given in (2.11) is 



MSE{Vd^^x)m{xMX} = h 



nh 



0Jh-2Mca + hhlca] +0, 



KdlwIDw Hessm{x)u 
1 



1 

u 



du 



P\ l,d 3 



+ 



n/i 



r + 



where {di{x)}f^i is an orthonormal basis of T^M described in Lemma A. 2. 6 of the 
Supplementary. 

Based on Theorem 4.3, 4.4 and Section A.l of the Supplementary, we know that 
the estimator (2.10) indeed can be used to estimate the embedded gradient of m. 
Since the application of the estimate of the gradient is not the focus of this paper, we 
refer the readers to [7, 26]. 



5 Numerical Examples 

To demonstrate the applicability of the proposed algorithm MALLER, we test it on 
a series of simulations and a real dataset and compared it with the nonparametric ex- 
terior derivative estimator (NEDE), nonparametric adaptive lasso exterior derivative 
estimator (NALEDE), nonparametric exterior derivative estimator for the "large p, 
small n" (NEDEP) and nonparametric adaptive lasso exterior derivative estimator for 
the "large p, small n" (NALEDEP) proposed in [1], for which the codes are provided 
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by the authors of [1]*. The code for implementation of MALLER is in the authors' 
homepage^. 

All the observed values of the predictors in both the training dataset and the 
testing dataset are normalized by := {xi — fl)/s, where fi is the sample mean of 
{xi}]Li, I = + 10 and s = maxjj=i^...^„ — In order to facilitate 

the notation we write xi instead of x^ in the sequel. In step 1 of our algorithm, we 
used the MLE dimension estimation code provided by the authors of [22]^ to evaluate 
the intrinsic dimension of the manifold. In step 2, we used the code provided by the 
authors of [40]^. In step 3, we chose /ipca = 0.015. In the bandwidth selection step, for 
each regressant, we worked out the bandwidth selection procedure given in Section 
3 on 21 logarithmically equi-spaced candidate bandwidths in the interval [0.01,0.1] 
when d = 1 and [0.01, hd] when d > 1, where 

This choice of hd is motivated by the following facts. Fix d > 1. The volume of 

d+l 

S'^ is |S''^| = l^d+i. , where T is the Gamma function, and the volume of a geodesic 
ball of radius < 6{d) ^ 1 centered at x G S'^, denoted as B^^^Jx), is approxi- 



mately ^^"^^ — ^ = '^'^dr{d/2) ■ Thus, the ratio of the volume of B^^^-^{x) to |5''^| is 
r{d, 5{d)) = ^^'^^^d^d/^2)'^'' ■ Suppose S{d) = 5 ^ 1 for all d, then r(d, 5) gets smaller as 
d increases. That is, if the number of data points sampled from S'^ is the same and 6{d) 
is fixed for all d, the number of data points located in B^^^-^ (x) decreases to zero expo- 
nentially. This fact plays a role in the numerics, especially in the bandwidth selection 
problem, since in practice the number of neighboring points is not controllable. We 



thus choose the largest bandwidth hd by solving (^v^^^r(^(^+^^)/^) = r(l,0.1) = 
which leads to (5.1). We emphasize the non-optimality of this scheme to set the 
candidate bandwidths for general manifolds of dimension d, which is out of the scope 
of this paper. The kernel function K used in step 4 of our MALLER algorithm was 
taken as K{u) = exp(— 7m^)I[o,i](m). 

*littp : / / www . eecs . berkeley . edu/~aaswani/EDE_Code . zip 
^http : / / www . math . princeton . edu/~hauwu/ regression . zip 
■fhttp : //www. Stat . Isa.umich. edu/~elevina/inledim.m 

^http : //www. vision. caltech. edu/lihi/Demos/Self TuningClustering.html 
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In Sections 5.1 - 5.2 we report the root average square estimation error (RASE) 
to measure the accuracy of different estimators: 



where m{xi) is the resuh of each estimator. 

We ran our simulations and data analysis on a computer having 96GB of ram, 
two Intel Xeon X5570 CPUs, each with four cores running at 2.93GHz. No parallel 
computation was implemented. 

5.1 Simulated data: regression on the Klein bottle 

Consider the 2-dimensional closed and smooth manifold, the Klein bottle, embedded 
in M^, which is parametrized by 0Kiein '■ [0, 27r) x [0, 2tt) so that 

{u, v) "^l^'" ((2 cos w + 1) cosu, (2 cos w + 1) sinw, 2sint> cos(m/2), 2 sinw sin(u/2)) . 

We sampled n = 1500 or 1000 points uniformly from [0, 27r) x [0,27r), denoted as 
{{Ui, V5)}"^i, and then obtained the corresponding n observations {X;}"^^ on the pre- 
dictors X by the parametrization 0Kiein- Notice that the uniform sampling design on 
[0, 2tt) X [0, 2tt) corresponds to a non-uniform sampling design on the Klein bottle. To 
generate the responses corresponding to {Xi}"^^^, note that the mapping 0Kiein 

is 1-1 and onto, so any {u,v) in [0, 27r) x [0, 27r) can be written as {u,v) = 4>iQcmi^) 
for some x in the embedded Klein bottle. So, consider the following regression model 
on the Klein bottle: 



e ~ A/'(0, 1) is independent of X, and ctq is the noise level (in Y) which determines 
the signal-to-noise ratio 



n+lO 




Y :=m(X) + a(X) e. 



where 



m(X) := 7sin(4f/) + 5cos(2l^)2 + 6exp{-32((f/-7r)2 + (\/-7r)2)} 
a{X) := (To(l + 0.1 cos(f/) + 0.1 sin(V)), 



snrdb := 10 log^g 




18 



Furthermore, let 

W = X + axV, 

where ax > 0, and rj is a bivariate normal random vector with zero mean and identity 
covariance matrix, independent of X and e. Consider estimating m(X) based on 
observations on {W, Y). In this case, W = X and X is observed without error when 
(Jx = 0, and W is X contaminated with error when ax > 0. In the simulations, 
we took (Tx = or 0.2 and snrdb = 5 or 2. For each simulated sample, we drew 
n observations {(Wi,Yi)}^^^ to form the training dataset. Then, independent of the 
training sample, we sampled randomly 10 points {W^j}"=t!+i as the regressants and 
tried to estimate the values of m at {Xn+j}]'Li based on {(PVj, Fj)}"^]^. 

We evaluated the performance of each estimator by computing the average and 
standard deviation of its RASE's over 200 realizations. The estimated dimension by 
the MLE intrinsic dimension estimator was 2 for all of the 200 realizatioins, as is 
expected. The results of all the estimators and their computation time are listed in 
Table 1 and Table 2, from which we can draw the following conclusions. When there 
is no error-in-variable, i.e. ax = 0, MALLER outperforms the four competitors in 
all of the cases, with significantly smaller RASE average and similar RASE standard 
deviation. Also, the MALLER performs well when there exists error in the predictors. 
The fact that the computation time for MALLER is longer than that for the other four 
estimators can be explained as follows. Besides the sample size n, the computation 
time for the estimators in [1] also depend on the ambient space dimension p which 
is 4 in this example. On the other hand, in addition to n, the computation time 
for MALLER also depends on the estimated intrinsic dimension d which is 2 in this 
example. This fundamental difference between MALLER and those in [1] will become 
apparent when p increases and p ^ d, as in the Isomap face example discussed in 
Section 5.2. 

5.2 Real data: Isomap face data 

We further tested our algorithm on the Isomap face dataset [37]^. The dataset consists 
of 698 64 X 64 images, denoted as {/f^}f^^, parametrized by three variables: the 

^http : / / isomap . Stanford. edu/datasets .html 
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Kk'iii lx)l(l(\ ax = 0. RASE. 




n = 1500 


n = 


1000 




snrdb = 5 


snrdb = 2 


snrdb = 5 


snrdb = 2 


MALLER 


1.8675 ±0.5222 


2.3818 ± 0.666 


2.3255 ± 0.5999 


2.7454 ±0.9151 


NEDE 


2.552 ± 0.5581 


2.9382 ± 0.631 


3.4209 ± 0.6535 


3.6469 ± 0.6793 


NALEDE 


2.5519 ±0.5581 


2.9417 ±0.6331 


3.4288 ±0.6522 


3.6523 ± 0.6798 


NEDEP 


2.5514 ±0.558 


2.9371 ±0.6313 


3.4212 ±0.6534 


3.6469 ± 0.6787 


NALEDEP 


2.5511 ±0.5583 


2.9406 ± 0.6335 


3.429 ± 0.6524 


3.6528 ± 0.6791 




Klein bottle, the computation time. 


MALLER 


76.9222 ± 29.0305 


68.114 ±22.3079 


32.9121 ± 10.191 


32.7163 ± 11.3034 


NEDE 


6.0438 ±0.1573 


6.0416 ±0.1709 


5.569 ±0.1514 


5.5878 ±0.152 


NALEDE 


11.6054 ±0.289 


11.5148 ±0.2853 


10.5719 ± 0.266 


10.5617 ±0.265 


NEDEP 


11.4768 ± 0.2978 


11.4656 ± 0.3199 


10.5246 ± 0.2875 


10.5576 ± 0.2896 


NALEDEP 


17.1086 ± 0.4276 


17.0057 ±0.4317 


15.5967 ±0.4015 


15.601 ± 0.4025 



Table 1: Regression on the Klein bottle without error in the predictors. The averages and 
standard deviations, over 200 realizations, of RASE and the computation time (in seconds) 
for different estimators tested on different configurations. 





Klein bottle, ax = 0.2, RASE. 




n — 


1500 


n — 


1000 




snrdb = 5 


snrdb = 2 


snrdb = 5 


snrdb = 2 


MALLER 


3.9227 ±0.6898 


4.02 ± 0.7214 


3.9514 ± 0.6785 


4.0512 ± 0.6932 


NEDE 


3.9754 ±0.6508 


4.1225 ±0.6255 


4.1697 ±0.6599 


4.2845 ± 0.6483 


NALEDE 


3.9759 ± 0.6509 


4.131 ±0.6252 


4.1702 ±0.6612 


4.2848 ± 0.6494 


NEDEP 


3.9759 ± 0.652 


4.122 ±0.6264 


4.1708 ±0.6601 


4.2848 ± 0.6479 


NALEDEP 


3.9767 ±0.6518 


4.1227 ±0.626 


4.171 ±0.6619 


4.2851 ± 0.6492 



Table 2: Regression on the Klein bottle with error in the predictors. The averages and 
standard deviations over 200 realizations of RASE for different estimators tested on different 
configurations. 
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horizontal orientation, the vertical orientation, and the illumination direction. Thus, 
the data were sampled from a 3-dimensional manifold embedded in M^"^^®^. When we 
view each image as a point in M^^^^"^, the ambient space dimension p = 64x 64 is large, 
so in [1] the authors suggested to rescale the images from 64 x 64 to 7 x 7 pixels in size. 
Denote the resized images of size kx k as {/f where k = 1, . . . , 64. We performed 
200 replications of the following experiment, which is suggested in [1]. Fix k = 7. We 
randomly split into a training set consisting of 688 images and a testing set 

consisting of 10 images. The horizontal orientation of the images in the testing set 
were then estimated based on the training set. Table 3, which summaries the results, 
shows that MALLER improves on the existing methods substantially in the sense 
of reduced RASE average and standard deviation. We mention that NEDEP and 
NALEDEP behave worse than NEDE and NALEDE due to the frequent occurrence 
of blowup in the iteration, and the reported results are the best ones among several 
trials we carried out. 





Isomap face database, k = 7 


RASE 


computation time 


MALLER 

NEDE 
NALEDE 
NEDEP 
NALEDEP 


1.2168 ±0.8131 
1.7852 ±1.2122 
1.7759 ±1.1995 
1.8685 ±1.2413 
2.8095 ±3.6525 


131.5847 ± 17.5136 
34.4606 ± 4.5847 

170.7088 ±28.8193 
53.7212 ±8.3594 

187.3745 ± 31.2623 



Table 3: The averages and standard deviations, over 200 replications, of RASE and com- 
putation time in seconds for different estimators tested on the resized Isomap face data 

r 7-71 698 
l^l Sl=l- 

Next, we carried out another 200 replications of the same experiment but with 
k = 14, 21, or 28. The MLE intrinsic dimension estimate was 3 in all the replications 
when k = 7,14 or 21, and was 4 all the time when k = 28. The results are given in 
Table 4. We mention that when k = 14, 21 or 28, it took long time to compute the 
methods in [1] and the experiment cannot be finished within a reasonable time frame, 
so we decided not to include them in the comparison. When k = 7, 8, . . . , 16, the 
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estimated time (average over 3 realizations) to finish one replication for the methods 
in [1] are plotted in Figure 2, which shows clearly the dependence of these methods 
on the ambient space dimension k x k. 





k = U 


A; = 21 


A; = 28 


Isomap face database, RASE 


MALLER 


0.9865 ± 0.5473 


1.0259 ±0.5098 


0.9369 ± 0.7403 




Isomap face database, computation time 


MALLER 


108.3796 ± 12.0145 


148.9841 ± 20.0436 


164.3576 ± 28.8329 



Table 4: The averages and standard deviations over 200 rephcations of RASE and com- 
putation time in seconds for MALLER tested on the resized Isomap face data {/f}fli, 
k = 14,21,28. 




6 8 10 12 14 16 

k 

Figure 2: The running time for MALLER, NEDE, NALEDE, NEDEP and NALEDEP 
when A; = 7, 8, . . . , 16. The y-axis is in the natural log scale. 

Note, from Table 3 and Table 4, that when k changes from 14 to 7 the RASE 
average of MALLER increases noticeably, and it decreases when k changes from 21 
to 28. In the following are some partial explanations for these. It is clear that 
resizing the images from 64 x 64 pixels to k x k pixels for a smaller value of k causes 
a reduction of the resolution of the images. Taking A; = 1, the extremal case, as 
an example, the images {//}f^^ are scalar values distributed in M, and obviously 
the topological structures of are totally different from that of the original 
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images. This fact indicates that over-resizing the images leads to the distortion of 
the topology, which partially explains the increase of the RASE of MALLER when k 
changes from 14 to 7. Further, the fact that the RASE average dropped again when k 
changes from 21 to 28 may be explained by the reason that, as the estimated intrinsic 
dimension increased from 3 to 4, the extra dimension helps to reduce the estimation 
error introduced by the complex geometric structure when the resolution is high. We 
emphasize that the above explanations for the RASE average fluctuation need to be 
quantified with further analysis, which is out of the scope of this paper and will be 
reported in a future work. 

In conclusion, the Isomap face database example shows the strength of MALLER: 
once the number of observations n is large enough compared with the intrinsic di- 
mension d of the manifold, which may be small compared with the dimension p of 
the ambient space, our method provides improvement over existing estimators from 
both the viewpoints of the prediction error and computation time. 

5.3 Gradient and Covariant Derivative Estimation 

We tested our estimator i*gradm(a;), given in (2.10), on the 2-dimensional torus T 
embedded in via l, which is parametrized by, except for a set of measure zero, 

(p : (m, v) ((2 + cos(f )) cos(m), (2 + cos(f )) sin(u), sin(t>)) , (5.2) 

where {u,v) e I := (0, 27r) x (0, 27r). Considered model (2.1), where X = (j){U,V), 
the regression function m : T — M is given by 

m{(p{u, v)) = cos(m) sin(4t> + 1), 

e ~ A/'(0, 1) and (j{l-^{X)) = ao{l + 0.1 cos{U) + 0.1 sin(l^)) with ao chosen so that 
snrdb= 5 or 40. A direct calculation leads to 

^ sin^('u) sin(4f + 1) — 4 cos('u)^ sin(f ) cos(4f + 1) 

i*gradm(0(u, v)) = - sm{u) cos(m) sin(4f + 1) - 4 sin(u) cos(m) sin(f ) cos(4f + 1) 

y 4 cos('u) cos(f ) sin(4t> + 1) J 

(5.3) 

The detailed calculation of (5.3) can be found in the Supplementary. 
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We sampled 6000 points {{Ui, Vi)}^^^ uniformly from / and then generate {(Xj, Yi)}^ 
according to the above model. Notice that this sampling scheme is non-uniform on 
the torus. Then we randomly picked 3000 points {Xj = 0(f/j, Vi)}f£gooi test- 
ing sample, and compute the gradient estimates {i*gradm(Xj)}^£gQQ^ based on the 
training sample {{Xi,Yi)}^^^. The estimates are visually demonstrated in Figure 3, 
together with the ground truth (5.3) for comparison. 



Figure 3: Gradient estimates. Left: snrdb=40dB; Right: snrdb=5dB. The blue circles 
are the portion of the testingsample {("Ui, Wi)} •'£gooi such that \vi\ < 1 and Ui > 2, the red 
arrows are <.*gradm((/>(nj, fj)) and the black arrows are t*gradm(0(uj, Uj)). 

6 Implications to Manifold Learning 

Another branch of approaches to high-dimensional, massive data analysis are the 
graph based algorithms such as locally linear embedding (LLE) [32], ISOMAP [37], 
Hessian LLE [9], the Laplacian eigenmap [2], local tangent space alignment [42], 
diffusion maps [7] , and vector diffusion maps [36] . In addition to preserving the non- 
linearity of the data structure, one advantage of these approaches is their adaptivity 
to the data, that is, the model imposed on the data is relatively weakened so that the 
information revealed from the analysis is less distorted by model mis-specification. 
These advantages render the graph based algorithms attractive and popular in data 
analysis. When the data are assumed to be sampled from a compact and smooth 
(i-dimensional manifold M, the key step of these methods is the learning of the intrin- 
sic geometric quantities, for example, the Hessian operator [9], the Laplace-Beltrami 
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operator [2, 7] or the connection Laplacian [36]. What we are concerned with in this 
section is the estimation of the Laplace-Beltrami operator A of M, considered in the 
diffusion map framework [7], via MALLER. We refer the readers to these hterature 
for further discussions and references. Throughout this section, we make use of the 
same assumptions and notation as in Sections 2 and 4. 

We start with discussing the relationship between the diffusion map framework and 
generahzing the Nadaraya-Watson kernel regression method to the manifold setup. 
Suppose M is compact, smooth and without boundary. Fix a bandwidth h > 0. First 
we define a n x n weight matrix W and a n x n diagonal matrix D by 

m^,j) = ^( "^'^^'"^' ) and Di^,^) = pW{^,J). (6.1) 

Then A := D^^W can be interpreted as a Markov transition matrix of a discrete 
random walk over the sample points {Xi}^^^, where the transition probability in a 
single step from the sample point Xj to the sample point Xj is given by A{i,j). 

Note that A can be used to generalize the Nadaraya-Watson kernel method orig- 
inally defined for nonparametric regression on MP to the manifold M setup. Indeed, 
given the regression model (2.1), define this generalized Nadaraya-Watson estimator 
niNW of ^ at Xi as 

E"- f II^»-^jIIrp \ Y 
, N / i=l I Vh J ^ . , 

mNw{Xi,h) := {AY){t) = v n \ > « = 1, - • • 

i=i^ \ Vh J 

i.e. take A as the smoothing matrix of rriNwi', Clearly the conditional expectation 
of the estimator niNwiXi, h) becomes 

¥.{mNw{X,,h)\X] = {Am){i) = ^ /.^ J, . , (6.2) 

^i=i^ \ Vh ) 

where m is defined in (2.5). When m G C^(M) and Xj ^ the asymptotic 

expansion of (6.2) has been shown in [7, 18, 35]. Indeed, we have, as n — )■ oo, 

{Am){^) = m(X.) + ^Am{X.) + 2^^^^^|^^) + 0{h^) + 0,{^j^ 

Note that in [7] the kernel is normalized so that fii^ = 1 and ^i^/d = 2. When 
/ is constant, the second order conditional bias term contains information about 
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the Laplace-Beltrami operator of {M,g). This fact, however, is in general ignored 
when the focus is the nonparametric regression problem. On the contrary, since 
knowledge of the Laplace-Beltrami operator leads to abundant information about 
the manifold, in [7] the matrix Lq := h^^{D^^W — In) and its relationship with the 
Laplace-Beltrami operator are extensively studied, and the eigenvectors of A are used 
to define the diffusion map. When / is not constant, the /-dependence is removed 
by the following normalization [7]. Define a n x n weight matrix Wi and a n x n 
diagonal matrix Di by 

n 

Wi = D-^WD-\ and Di{i,i) = J2^iihj) (6.3) 

i=i 

where W and D are defined in (6.1), and 

When n oo, it is shown in [7] that for any m G C^(M) the matrix Li satisfies the 
following convergence: 

(L,m)(.) = ^Am(X.) + 0(/.) + 0.( ^,,, J,,^,,, ) • (6.4) 

Notice that the effect of the normalization (6.3) is actually to cancel out the effect 
of the non-uniformality in / on the matrix Lq. We remark that the matrix D^^Wi 
can thus be used as the smoothing matrix of a new estimator of m which is design 
adaptive. 

If we view the Nadaraya- Watson kernel method on W as the local zero-order 
polynomial regression, the LLR on W can be viewed as the first-order companion of 
the Nadaraya- Watson kernel method which takes the local slope into account [34] . We 
discuss extensively its generalization to the regression on manifold setup in Section 
2, its large sample behaviors in Section 4, and its numerical results are demonstrated 
in Section 5. Recall that the conditional bias of MALLER, given in (A. 30) of the 
Supplementary, depends on the Laplace-Beltrami operator: 

E{m(X, h) - m{Xm = h^Am{X) + 0{h' + hh%l) + O.^^^^j^^). 
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This fact leads us to build up an alternative matrix to approximate the Laplace- 
Beltrami operator. Fix h > and consider the following n x n matrix 



Ap 



(6.5) 



where the i-th entry is defined by (2.6), (2.7), and (2.9). Note that Ap is the smoothing 
matrix of MAILER, that is, ApY = (m(Xi, h), . . . ,m(X„, h))^ from (2.9). Using 
this smoothing matrix and defining 

Lp = h ^ (^Ap — In) , 

for any m G C'^(M), we directly have 

(^pm)(z) = ^Am(X,) + 0{h + h%t) + 0,{-^jl^). (6.6) 

Thus the matrix Lp can be used to construct an estimator of the Laplace-Beltrami 
operator A. Notice that we do not need an extra step to handle the non-constant 
p.d.f. issue here because the design adaptive property of m(X, h) ensures that the 
leading term in the right-hand side of (6.6) is independent of /. With the estimator 
Lp of A, massive data analysis can be carried out in the same way as those in the 
diffusion map framework if the manifold assumption is reasonable. We remark that 
the knowledge of the non-constant p.d.f. is useful in some problems. For example, in 
[7, 28] the authors showed a strong connection between the non-constant p.d.f. with 
the Fokker-Plank operator, which is useful in the low- dimensional representation of 
stochastic systems. 

In Figure 4, some numerical results of estimating the A of M by this new method 
are demonstrated. We sampled 1000, 2000 and 4000 points uniformly from the S"^, 

and S'^ embedded in M^, and MP respectively, and built the matrix Lp from the 
sample points with h = 0.1. It is a well known fact that the l-th eigenvalue of the 
Laplace-Beltrami operator of S'' is —l{l + k — l) with multiplicity [^^^) — {^^l "^), where 
(^) is the binomial coefficient. The results in Figure 4 show that the new estimator 
for the Laplace-Beltrami operator agrees with this well known fact numerically. 

Up to now there are two ways to estimate the Laplace-Beltrami operator: one is 
based on generalizing the Nadaraya- Watson kernel method to the manifold setup as 
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Figure 4: From left to right: bar plots of the first 30 eigenvalues of Lp when the data 
points were sampled uniformly from S^, and S^. Note that the first few eigenvalues 
of A are 0, -2, -6, -12 for S^, 0, -3, -8, -14 for S'^ and 0, -4, -10, -18 for S^, and the 
multiplicities of the first few eigenvalues of A are 1,3,5,7 for S*^, 1,4,9,16 for and 
1,5, 14, 30 for S"^. This fact is well resembled by the corresponding spectrum of Lp. 

suggested by (6.4) and studied in [7], and the other is based on MALLER, which gen- 
eralizes the LLR to the manifold setup, as suggested by (6.6). The difference between 
these two approaches is most obvious when the manifold has smooth boundary. 

Suppose M is compact, smooth and its boundary dM is non-empty and smooth. 
When Xi G M^, the asymptotic behavior of D^^Wi has been shown in the proof of 
Proposition 10 of [7]: 

{D^'W,m){z) = m(Xo) + VhC^dM^o) + 0{h) + Op(^^^^j^j^j^^ , (6.7) 

where Ci = 0(1), Xq G dM is the point on the boundary dM closest to Xi, and 
u is the normal direction at Xq. If the y/h-order term is non-zero, the estimator 
{Lim){i) in (6.4) blows up when h ^ 0. To avoid this blowup and to get an estimate 
of the Laplace-Beltrami operator on M, the Neuman's boundary condition ^ = is 
necessary. Thus, solving the eigenvalue problem of Li is a discrete approximation to 
solving the eigenvalue problem of the Laplace-Beltrami operator with the Neuman's 
boundary condition. 

The situation is totally different for the proposed estimator Lp. The asymptotic 
behavior of the conditional bias of MALLER at Xi G provided in Corollary 4.1 
leads to 

(L,m)(z) = -J2ckiX,)Vl^,miX,) + Opih-'/'h%l + hlH) + 0,(^™^). (6.8) 

k=l 

Thus, we know that when X^ is near the boundary, the estimator Lp does not blow 
up when /i — )• 0, and a different boundary condition can be imposed. 



28 



Notice that the importance of using different bandwidths in the tangent plane 
estimation and in the LLR on the tangent plane becomes clear from (6.6) and (6.8). 
Indeed, if we take hpca < h then it follows from (6.6) (resp. (6.8)) that the first order 
error of the estimator for the Laplace-Beltrami operator inside the manifold is smaller 
than the order h^^'^ (resp. h^^'^). 

In Figure 5, we demonstrate the eigenvectors of the estimator Lp for the Laplace- 
Beltrami operator of a manifold with boundary. Specifically, we sampled 2000 points 
{Xi}f^^ uniformly from the interval [0, 1] embedded in M, and evaluated the eigen- 
vectors of Lp built on {Xi}f^^. Notice that the eigenvectors shown in Figure 5 can 
not happen, except for the first one, if the Laplace-Beltrami operator satisfies the 
Neuman's condition. The survey of the boundary condition suitable for the estimator 
Lp is out of the scope of this paper, and we leave it as a future work. 
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Figure 5: From left to right: the first four eigenvectors of Lp and the first 10 eigenvalues 
of Lp when sampling from [0, 1]. The first two eigenvalues are zero. Notice that the second, 
third and fourth eigenvectors can not happen if the Laplace-Beltrami operator satisfies the 
Neuman's condition. 



7 Discussions 

When the p-dimensional predictor vector X has some d-dimensional manifold struc- 
ture, we obtain MALLER by constructing the traditional LLR on the estimated em- 
bedded tangent plane, which is of dimension d instead of p. Consequently, both the 
estimation accuracy and computational speed depend only on d but not on p. Keeping 
p, d, n as fixed numbers, this feature is particularly advantageous when d <^ n < p, as 
is shown in the Isomap face database example in the numerical section. We mention 
that MALLER works in this case hinges on the capability of estimating the tangent 
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plane. Since our model is noise free in the predictors, this capability can be explained 
by the theoretical findings in [19] and [27]. In [27], the spike model is studied and the 
recovery of the subspace spanned by the response vectors is guaranteed even if p > n, 
when there is no noise [27, (2.13)]. Under the manifold setup, locally the manifold 
model behaves like the Euclidean space, so it is expected to have similar results as 
those in [27], which is shown in [19]. Furthermore, we emphasize that, while in [1] this 
case is modeled as the large p small n problem, where p grows with n, and sparsity 
conditions and thresholding are employed, here we treat p as a fixed number and take 
the fact that n is larger than d. 

7.1 The Relationship with NEDE 

MALLER is not the first LLR regression scheme proposed to adapt to the mani- 
fold structure. NEDE, given in [1], is a manifold-adaptive LLR constructed in the 
p-dimensional ambient space with regularization imposed on the directions perpen- 
dicular to the estimated embedded tangent plane. At the first glance MALLER seems 
to be a special case of NEDE [1, (4.6)] by taking A„ = oo in [1, (4.6)]. However, there 
are several distinct differences between the two methods. In this section we follow 
the notation used in [1]. 

First, when A„ = oo for all n, although /3 in [1, (4.6)] is forced to be located 
on the estimated embedded tangent plane, the NEDE algorithm still runs in the 
ambient space and the minimization problem in [1, (4.6)] becomes ill-posed. Indeed, 
the solution in [1, (4.6)] depends on the inverse of the matrix C„ + A„P„/n/i'^"''^, 
which is unstable to solve when A„ = oo. This numerical instability of NEDE when 
A„ = oo can also be shown numerically. As an illustration, we ran NEDE with 
A„ = e^°° (within the machine precision) on the Isomap face database with the images 
downsized to 7 x 7 pixels. Then, it happened that the optimal value of d chosen by the 
NEDE algorithm was close to 49 = 7 x 7 = p (48.325 ± 1.3019 over 100 replications) 
due to the degeneracy of Cn + XnPn/nh'^+^, and the final RASE was 12.3684 ± 6.1161 
(over 100 replications), which is roughly ten times of the RASE of MALLER. Even 
when we set c? = 3 and A„ = e^°° in the NEDE algorithm and tested it on the same 
7 X 7-pixel images, the final RASE was still 10.5829 ± 6.0986 after 100 rephcations. 
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Second, even if NEDE [1, (4.6)] is stable to solve when A„ = oo, the bandwidth 
selection problem in NEDE still depends on p, which leads to different results com- 
pared with MALLER. Specifically, the selected bandwidth would be larger and hence 
the bias is increased. 

Third, in NEDE the bandwidth used in the tangent plane estimation is taken to 
be the same as the one used in the LLR estimation, while in MALLER we estimate 
the tangent plane using a different bandwidth hpca which by the asymptotic analysis 
should be taken to be smaller than the bandwidth h in the LLR step. Thus, the tan- 
gent plane estimate obtained by NEDE is different from that obtained by MALLER. 
Since this estimation error does not contribute to the leading bias term, the differ- 
ence is not significant in the regression problem. However, if we would like to have a 
better estimator of the Laplace-Beltrami operator, this error becomes significant, as 
is shown in Section 6. 

In conclusion, MALLER is different from NEDE even if the parameter in NEDE 
is set to oo, both theoretically and numerically. And, the key features that render 
the two algortihms different are those mentioned above, not the more sophisticated 
method MALLER uses to select the bandwidth in the LLR. 

7.2 Future Directions 

To sum up this paper, here are several issues left open and are of interest for future 
research: 

1. Like in any smoothing methods, bandwidth selection is crucial for the proposed 
MALLER. Our bandwidth selection procedure is built on balancing between 
estimates of the conditional bias and variance. Although this approach worked 
well in our numerical studies, there is still room for improvement. 

2. We include in our algorithm a clustering tool to alleviate numerical problems 
caused by the condition number, without having to estimate the condition num- 
ber. This is not the ultimate solution; instead, the ideal solution is to estimate 
the condition number, and then use that information in the subsequent steps. 



31 



3. In this paper we consider the case where the predictor vector is directly observ- 
able. In some situations, the predictor vector itself is subject to noise, and the 
tangent plane and regression estimation steps has to be adjusted accordingly. 
This is closely related to the deconvolution and measurement error problems in 
the literature, in the Euclidean setup. 

4. In MALLER, the dimensionality is reduced to the intrinsic structure of the 
predictors. The dimensionality may be further reduced by taking into account 
the relationship between the response and the predictors [38, 39]. 

5. The smoothing matrix of MALLER is shown to be useful for estimating the 
Laplace-Beltrami operator with the boundary condition different from Neu- 
man's condition, it is worthwhile to investigate further such a new set of tools 
for manifold learning. 

6. In applications, the response itself may be multivariate as well. The case when 
the responses are positive-definite matrices and the predictor vector is non- 
degenrated in W was considered by [43] . It is interesting to investigate the case 
when both the response and the predictor vector have manifold structures. 
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Supplementary Materials for "Local Linear Regression 
on Manifolds and its Geometric Interpretation" 

by Ming- Yen Cheng, and Hau-Tieng Wu 



A.l Exterior derivative, covariant derivative and 
gradient 

In this appendix we provide the required differential geometry background about the 

covariant derivative, gradient, exterior derivative and their relationships. We refer 

the readers to [8] for more details. 

We start from recalling the definition of the gradient vector field of a given function 

defined on the Euclidean space. Given m : M*^ — )■ M, the gradient vector field or the 

total differentiation, denoted as Vm is defined as 

^ / dm dm\ 
\dxi"''dxd) 

so that for v we have the directional derivative 

V„m(x) := (Vm)(f ) := lim ^ j —. (A.l) 

Often we use another notation to represent the directional derivative: 

(Vm(x),t;) := V^,m(x) (A. 2) 

This definition, however, can not be generalized to the manifold setup directly. In- 
deed, the quantity x + tv in (A.l) does not make sense in general. To obtain a suitable 
notion of differentiation, we consider the following definitions. Fix a differentiable d- 
dim manifold M and a function m : M — )■ M. For a given differentiable vector 
field V , locally around x G M we can find a curve c{t) so that c(0) = x G M and 
c'(0) = Vx, the value of at x so that V acts on m at x by 

Vm{x) := . (A.3) 

^ ^ dt i=o ^ ' 

The exterior derivative of m, denoted as dm at x is defined as: 

((dm)V)(x) := ((dm),, V^) := Vm{x), (A.4) 



where (■, ■) means that the first entry is the dual of the second entry. We can thus 

view the exterior derivative of m as a 1-form, which maps a given vector field into a 

scalar valued function. Next we define the covariant derivative of m, denoted as Vm. 

Fixed a curve c{t) on M so that c(0) = x. The covariant derivative of m in the 

direction of c'(0) is defined as 

„ Pc{o),c(i)m(c(t)) - m(c(0)) 
Vc'(o)m := lim , 

where Pc{o),c{t) is the parallel transport of the trivial scalar bundle. Since Pc{o),c{t) is 

trivial, the covariant derivative of m in the direction of c'(0) is reduced to 

„ m(c(t)) -m(c(0)) dm(c(t)) , , 

Vc'(o)m = liHi \^ — = = Vm{x). (A.5) 

Thus Vm Is a 1-form, which maps a given vector field to a scalar value. If M is 
Riemannian, that is, M is endowed with a Riemannian metric g, we can further 
define the gradient of m, which is a vector field denoted as gradm, as: 

5((gradm(x),14) = ((dm)^,14). (A. 6) 

It is clear from (A. 4) and (A.5) that for a given differentiable function m, its 
exterior derivative and covariant derivative are the same. Notice that from (A.l) 
and (A.5), the covariant derivative of m defined on M is a natural generalization of 
the total derivative of m defined on the Euclidean space. In other words, the total 
derivative of m defined on the Euclidean space should be viewed as a 1-form. The 
gradient defined in (A. 6) is directly related to the covariant derivative via the metric 
g. This definition is exactly the same as that in (A. 2) since in the Euclidean space, 
the metric g in the local coordinate {di}f^i around x is nothing but (fi'ji)i<jj<^ = ^d, 
where gij := g{di,dj). In other words, if we view the Euclidean space as a manifold 
with the canonical metric, we can either view the total differentiation as a 1-form, the 
covariant derivative (A.l), or as a vector field, the gradient (A. 2); but in the manifold 
setup, these two notions are not exactly the same but related by the chosen metric g 
as in (A. 6). 

With the above definitions and clarifications, for a fixed local coordinate around 
X, we have 

d 

gradm = ^ g^^dimdj, (A. 7) 



where {di}f^i is the coordinate around x, diUi is defined by (A. 3) and 
the inverse of {gjk) 



l<i,j<d 



IS 



l<i,j<d' 



while the covariant derivative of m is 

d 

dm = Vm = dimdx\ 
1=1 

where {dx^jf^^^ is the dual of {di}f^^. Thus, if we choose a normal coordinate around 
X so that Qij = 6ij at x, where 6ij denotes the kronecker delta, the coefficients of the 
covariant derivative of m at a; is the same as the coefficients of the gradient of m at x. 
Note that gradm(x) (or dm(x)) is the same regardless the choice of the local basis. 

Notice that as is stated in Theorem 4.3 and 4.4, the estimated first order covariant 
derivative of m, Va.m(x, h), depends on the estimated basis of ^^kT^M. Thus, we have 
to take this basis into account to estimate the embedded gradient of m, L^:'Vm{x), as 
is considered in (2.10). Notice that since MALLER provides the estimate of Vdi^TL at 
X for I = 1, . . . ,d, we can get the estimate of the covariant derivative or the exterior 
derivative of m by taking the dual basis of {di}f^i into consideration. 

We demonstrate the detailed calculation of the gradient given in (5.3). Since 
(j){u, v) = ((2 + cos(f )) cos('u), (2 + cos(f )) sin('u), sin(f )), It is clear that 

-(2 + cos(f )) sin(-u) — sin(t>) cos('u) 
d(f) = (2 + cos(i;)) cos(n) — sin(t>) sin(M) 
cos(f) 
By denoting ci = (1, 0) G and 62 = (0, 1) G M^, we get a set of embedded vector 
fields defined on 0([O, 27r) x [0, 27r)): 

d0(ei) 



and 



El 



d0(e2) 



I|d0(ei) 



sin(M), cos('u), 0) 



sin(t>) cos(m), — sin(t;) sin(M), cos(f )), 



\me2)\\ 

which are orthonormal with related to the canonical metric of M.^. Since l is an iso- 
metric embedding of the torus into M^, Ei = t^^di, i = 1,2, where di is an orthonormal 
frame defined on the torus. Thus, by (A. 7) the embedded gradient of m at l{x) can 
be evaluated by 

6^,(gradm(x)) = di'm{x)L^:di{x) + d2m{x)L^d2{x) = dim{x)Ei{x) + d2m{x)E2{x), 

(A.8) 



where di{x) is the value of di at x. By definition, we have 



d2m{x) 



dim{x) 



(\m{ci{t)) 
dt 

dm(c2(t)) 
dt 



t=o — 



dt 

dm(0(M, V + 1)) 



4 cos(m) cos(4t> + 1) 



sin(M) sin(4t' + 1) 



2 + cos(f ) 



t=o — 



where l{x) = f), Ci(0) = x and c^(0) = 9j(a;) for i = 1,2. Note that d0(ei) is 
not of unit norm, so we have to normahze ei by 2 + cos(t;) when we evaluate dim{x). 
Plugging the above into (A. 8), we get (5.3). 



The following lemmas are needed to finish the proofs of the theoretical results. The 
proofs of the first three lemmas can be found in [36]. The first lemma describes how 
the volume form depends on the curvature. The second lemma describes how to 
express the relationship between two points on the manifold M after being embedded 
in MP. Recall that the notion of "subtraction" between two points on M is not 
well defined. However, once these two points are embedded to W, the notion of 
"subtraction" makes sense, and the result of subtraction can be expressed by some 
geometric quantities of M and the embedding itself. The third lemma describes the 
error when we try to estimate the geodesic distance between two close points on M 
by the Euclidean distance between their embedded points. Notice that in practice 
the geodesic distance between two close points on M is unknown a priori, and we can 
only estimate it by the Euclidean distance between their embedded points. 

Lemma A. 2.1. In polar coordinates around x E M, the volume form dV is 



where 6 G T^M, \\e\\ = 1 andt>0. 

Lemma A. 2. 2. Fix x G M and denote by exp^ the exponential map at x. With the 
identification of Ti_(^x)^^ with W, for 9 G T^M with ||0|| = 1 and t <^1, we have 



A. 2 Proofs 



dV^(exp^t0) = {f^-^ + f^+^Rzcie, 6) + O(t'^+2))dtd0 



<.(exp^t^) =L{x)+ti^O + t 



+ o{e). 



(A.l) 
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Lemma A. 2. 3. Suppose x,y & M such that y = exp^(t9), where 9 G T^M and 
\\9\\ = 1. Ift ^ 1, then i = \\l{x) — L{y)\\KP <^ 1 satisfies 

t = i+^\\lU9,9)\\P + Oii'). (A.2) 

By combining the above lemmas, we get the following two lemmas. In Lemma 
A. 2. 4, we quantify the volume error introduced by estimating the geodesic distance 
between two points x,y G M by the Euclidean distance between l{x) G MP and 
L{y) G MP. In Lemma A. 2. 5, we collect some routine calculus. 

Lemma A. 2.4. Fix x E M and < S <^ 1. For Vi G S^^^ , i = 1, . . . , i, we have 



UU{y-x,v.^dViy)= / UU{y - x,v,)dViy) + 0(5 

Bfix) JBfix) 

where 

Bf{x) := L-^ {BY{x) n l{M)) C M. 
In particular, the volume of B^{x) differs from that of Bg^[x) by 0(6'^^'^). 

Proof. By direct calculation: 

^li{y-x,Vi)dViy) 

/ / UU,{ttJ + 0{t^),Vi) [f"-' + 0(t^+i)] d9dt 

Jo Js^-i 

[ [ nti {tt,9 + 0(^2) [t^-^ + Oif^^)] d9dt + 0(6" 

Jo Js<i'^ 

nti(y-x,t;.)dV^(2/) + 0(5^+'+^), 



Bf{x) 
S+0{P) 



+l+2\ 



'Bf{x) 

where the first equality comes from Lemma A.2.1, Lemma A. 2. 2 and Lemma A. 2. 3 
and the others comes from direction calculations. □ 

Lemma A. 2. 5. Fix x G M\M^, where h <^ 1, v E M^ , a function G C^(M) and 
the kernel function K compactly supported in [0, 1] so that -ft'lp,!] G (^""^([0, 1]). Then 
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for each £ G N we have: 

(a) EKiiX, x)<f){X) = /i,,o/(x)0(a;) + 0(/i); 

(6) EKi{X,x)iX-x)^iX) 
d 

= V<?,2| ^ [0(a;)i*5/V9j(x) + f{x)L^diVai(l){x) 

(c) E(ir^(X,x)(X-a;)(X-xf0(X))^ . 

_ / + 0{h^) when l<i=j<d 

I 0{h^) otherwise 

(d) EKi{X,x){X -x){X- xf{X - X, v)(f){X) 

Proof. These expectations are evaluated by Taylor's expansion and by Lemma A. 2.1 
to Lemma A. 2. 4. We start with evaluating (a). 

EKiiX, x)0(X) = / Kiiy, x)cP{y)f{y)dV{y) 
= [ Ki{y,x)<P{y)f{y)dV{y) + 0{h) 




X (^/(x) + tVef{x) + Oif)^ {f^-^ + O{t'^+^))dtd0 + 0{h) 



= fJ'£,of{x)4){x) + 0{h), 

where the first equality comes from (4.3), the second equality comes from Lemma 
A. 2. 3 and Lemma A. 2. 4, the third equality comes from the Taylor's expansion and 
Lemma A. 2.1 and the last equality comes from the symmetry of S'^^^. Indeed, the 
odd moments in the integral vanish because S'^~^ is symmetric. 
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Next, by the same arguments as those leading to (a) and Lemma A. 2. 2, the left 
hand side of (b) becomes: 



¥.Ki{X,x){X-x)<P{X)= / Ki{y,x){y-x)<P{y)f{yy'^V{y) 
Kiiy, x){y - x)<l>{y)f{y)dV{y) + 0{h'/') 

X + tV,0(a;) + Oit')) (/(x) + tV,/(a;) + 0(1^)) 
X (t'^-i + Ric(^, e)t'^+' + 0{t'^+^))dtde + ©(/i^'/') 

= h [ I K'{t) U{x)L,dVef{x) + /(x)6.^V,0(a;) 
Js''^-^ Jo ^ 

^^UMim^^y^MtdO + Oihl). (A.3) 

A direct calculation shows that 

/ eVef{x)de = J2 ^^VaJ{x) [ O'e'dO = ^^5,VaJ(a;). (A.4) 

l,k=l -JS^-' ^ 1=1 

By plugging (A.4) into (A.3) we conclude (b). 

By the same arguments as those leading to (b), we get (c): 

E {Kl{X, x){X -x){X- xfct>{X))^^^ 

Kiiy,x)iy ~x)iy- xf<Piy)fiy)dViy) 

LJ^'-''i<if)^<7f))'^''--''-'' '''''' 

X ((j){x) + tV,0(x) + Oit^)) (fix) + tVefix) + O(t') 



X (t'^-^ + mc{e, e)t'^+^ + o{t'^+^)'^ dtde + o{h^) 

hf{x)(P{x) [ [ K (t) LJiLjff^+^dtde + 0{h^) 
is^-i Jo 

h^f{x)(j){x) + 0{h^) when l<i = j<d 
Oih"^) otherwise 

where the last equality comes from the fact that is linear. 



(A.5) 



Equation (d) follows from the same arguments as in the above: 

^Ki{X, x){X - x){X - xf {X - x,v)(f){X) 

Ki{y,x){y - x){y - xf{y - x, v)(j){y)f{y)dV{y) 



d-l 



+0(t''+^)|dM^ + 0(/i^/2) 
+0{h''/^). 



□ 



Next we describe how the local PCA provides the estimate of the tangent plane. 
Although locally a manifold M is close to some Euclidean space, there is always a 
gap caused by the curvature of M. Lemma A. 2. 6 states its influence on the tangent 
plane estimation by the local PCA. 

2 

Lemma A. 2.6. Suppose hpca n^^+r. Then, if x G M\M^, the eignvectors 
{Ui{x)}f^-j^ corresponding to the d largest eigenvalues of the sample covariance ma- 
trix formed in the local PCA differ from an orthonormal basis {dk{x)}f^i to T^M 
by: 

Ui{x) = LAix) + O.ihli'Jwi + 0,ih%'Jwl forl = l,...,d, (A.6) 
where Wi G l^^T^M, wf- ± l^^T^M, and \\Wi\\ = \\wf-\\ = 1, and, if x G M^, 

Ui{x) = lM^) + 0,{hl'^)wi + 0,{hy;^,)wi forl = l,...,d, (A.7) 

where Wi G l^T^M, wf- ± l^T^M, and both Wi and wf- are ofO{l). 

_ 2 

Suppose hpca ^ 0{n '^+'^) and x G M\M^, then a better convergence rate is 
achieved. Indeed, (A.6) becomes 

Ui{x) = lM^) + Op{hlil)wi + Op{hpca)wf- forl = l,...,d. 
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The proof of this lemma follows the same lines as those in [36] except some wrinkles 
caused by the two differences mentioned above. We now detail these wrinkles and 
refer the readers to [36] for the detailed proof. 

Proof. Fix X G M\M^. Choose a normal coordinate {^^(x)}^^]^ around x and assume 
M is properly rotated and translated so that x = Opxi and e,j = L^:di{x), for i = 
1, . . . ,d, where Opxi is the p x 1 zero vector and is the unit length p x 1 vector 

with the i-th entry 1. Denote := x^ep (^)p^(i^-)(X)X, where x is the indicator 

function. 

For later use, we prepare some calculations. First, since / G C^(M) and M is 
compact, by plugging i = 1 and Vi = Si into Lemma A. 2. 5 and taking Taylor's 
expansion, we have 



E{Z^,ei)= / {y,ei)f{y)dV{y) {A.t 

'fiM {x) 

//in -- 



''•pea 

I'pca. 



S'*-! Jo 
■-0{hlc.') 



[tLj + 9), (fix) + tVefix)) f'-'dtdO + 0(4^^') 



Similar calculation leads to: 

I 0{hpca ) otherwise. 

With (A. 8) and (A. 9), we can finish the proof. Recall that the sample mean of 
■^xTpca i^ denoted by fi^- Then, it follows from the Central Limit Theorem (CLT) 
and (A. 8) that 

(/. e,) = -^(X ^^^^l 0{h%l-'') + 0,{n^h%l'-') ifl = l,...,d 
^ fc=i *' 1 0{hpil^^) + Op{n^hpit^^) otherwise. 

Since hpit^^ dominates n~^/'^hpit^^ asymptotically, due to the assumption hpca x 

2 

n , we conclude that 

f^. = Op{hiil-^'). (A. 10) 
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Next we consider the sample covariance matrix S^.. By (A. 8), (A. 9), (A. 10), and 
similar calculation as in the above, we have 



^ 1=1 

E(Z,, e,){Z,, e,) + 0^ + Op{n^ htit'^^) otherwise, 

where the second Op term comes from the finite sample variance. By (A. 9) and the 



if 1 < i,j < d 
iid+l<i,j<p 



assumption hpcs, ^ n we get 



d 



[ud/2+l 
"pea 



Op— dxd Op— rfxp— d 



+ 



l/2x 
pV-pcaJ 



Op{h 

Op{hpca, 



^p(^pca) 



where Omxm' is the zero matrix of size m x m', for any m,m' G N. As a result, we 
get the equation (B.44) in [36]. Then we can analyze by the perturbation theory 
exactly in the same way as in [36], so we skip the details. When x G M^, the same 
calculation applies and we skip the details. □ 

Before proving Theorem 4.1 and Theorem 4.2, we prepare some notation and 
setups. Fix X. Recall that is a pxd matrix with the k-th column Uk{x) determined 
by the local PCA. Denote y := B^{y — x) and Xi := B^{Xi — x), where y E M and 
Xi E X. To simplify the notation, we denote 

:= i?a;Hessm(a;)i?J, 
6, := diag(a2(6-i(Xi)), . . . , a^ir^X^))), 
0.m{x) := [xiB.essm{x)xi ... x^B.essm{x)Xn]'^ ■ 
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For a given function : M k-)- M, ^ g N and v eMJ', we define 

:=EKiiX,x)^iX), 

■.= EKi{X,x){X-x)^{X), 

:=EKiiX,x)iX - x)iX - xf<P{X), 

■= EAl(X, x){X -x)iX- xf{X - X, t;)0(X), 
1 " 

qi := - y^Kh{Xi, x)xf}iessm{x)xi, 
1=1 
1 " 

q2 := - KhjXi, x)xf}iessm{x)xiXi. 
^ 1=1 

A. 2.1 [Proof of Theorem 4.1] 

Proof. Fix x G M. Denote by {f/fc(a^)}fc=i the orthonormal set determined by local 
PCA. Choose an orthonormal basis {ek}^^i of MJ', where is the p x 1 unit norm 
column vector with the fc-th entry 1, and assume t is properly rotated and translated 
so that X = Opxi and Cj = L^di{x) for i = 1, . . . , (i, where Opxi is the p-dimensional 
zero vector. 

With the notation Y and m defined in (2.5), clearly we have 
E{rn{x,h)\X] = vl{Xl^.,^.,)-^^l^,EY = i;nX^W,.X,.)-'X^W,m. (A.ll) 
Take y = exp^{t9), where t = 0{h^^^) and \\9\\ = 1. By Lemma A. 2. 2 we have 

= i{y) -X- t^lUe, 6) + 0(^3), (A.12) 
which by Lemma A. 2. 6 leads to 

{i.e, Uk{x)) = {i,e, L^dk) + Op{h%i), (A.13) 

since w-^ is perpendicular to l^6, and 

{iUe,e),Uk{x)) = o,{h%t), (A.14) 

since the second fundamental form 11^ is perpendicular to the embedded tangent 
plane t^T^M. Therefore, for j = 1, . . . , (i, we have 

{tij, e,) = {UJ, U,{x) - Op{h%l)w,) (A.15) 
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= {y-x, u,{x)) - -{lUe^ei u,{x)) + o,{h'i'h%t} 

= {y-x, U,{x)) + 0,{hh%t + h'/'h%l) 
= y^ + 0,{hh%t), 

where the first equahty holds due to Lemma A. 2. 6, the second equahty holds due to 
(A. 12), the third equality holds due to (A. 14), and the last equality holds due to the 
assumption that hpca. < h. By Taylor's expansion on M, (A. 15), and the assumption 
that hpca < h, 



m{y) — m{x) 

teVm{x) + -Hessm(x)(^, 6) + 0{t^) 
2 



(A.16) 



d 1 



.T 



1 3 

Vm{x) + -y Hessm(x)t/ + Op{hh^ca), 

where the second equality is obtained by rewriting 9 = Ylk=i9(^^^k{x))dk{x) = 
J2k=i(^*^^ ^k)dk{x), because t is isometric. Since the kernel K is compactly supported, 
m is bounded, and M is smooth and compact, (A.16) leads to 



m[x) 
Vm(x) -' 



-£lm{x) + Op{hh-, 



3 

pea; I , 



(A. 17) 



where X^; is defined in (2.6) and is defined in (2.7). By plugging (A. 17) into 
(A. 11), the conditional bias is reduced to 



E{m(x, h) - mix)\X} = vJ{XlW^X^)-^Xi:W^i£lmix) + Op(/i/i|ca)). (A.18) 



Now we evaluate (A.18). By direct expansion, we have 
1 



n 



nEi=iKH{X,,x) '-Ei=iK,{X,,x)xf 



YJi=iKh{Xi,x)xi -Y,'i^iXiKh{Xux)xl 
Denote by 1 the constant function with value 1. By the CLT, we have 



1 1 

-Y^K^{Xi,x) = €l{l) + 0J— 



712 n't 



(A. 19) 



(A.20) 
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1 " 1 

- J2 Kh{Xu x)xi = B^elil) + O, (^-r^ 
n ^n^hi 2 



(A.21) 



and 



1 " 1 
-Y^x,K,{Xi,x)xf = B^€l{l)B, + 0J^-r-). (A.22) 
^ ^^-^ ^n^hi ' 

Note that in (A.21), the random variables {K^iXi^ x)xi\'^^^ are not independent since 

Xi = B'^{Xi — x) and B^ is evaluated from the random samples {Xi}^^^, and hence 

the CLT can not be applied directly. However, once we rewrite the left-hand side of 

(A.21) as -BJ {^Yl^=iKh{Xi,x){Xi — x)) , the summands become independent, and 

the CLT can be applied. The same comment applies to (A.22). The expectation in 

(A. 20) is clear from Lemma A. 2. 5. The expectation in (A.21) becomes 

d 

d 



h^BlY.,.d,VeJ{x) 



+hl / ;^(,) ^Jii^(M)/(x) ^,^,^^^^^^^^| 



d 

h^Bl ^A^dJ'ix) + Opihhic.) + 0{h 



= h^Vf{x) + 0,ihl), 

where the first equality holds due to Lemma A. 2. 5, the second equality holds due to 
(A. 14) and the third equality holds due to (A. 13) and the assumption that /ipca < h. 
Similarly, the expectation in (A.22) becomes 



hf{x) / / K{t)ee'^t^+^dtde + Op{hhjc^) + o{h^ 

Js-i-^ Jo 
h^f{x)h + Op{h'), 



where the first equality comes from Lemma A. 2. 5 and (A. 13). As a result, (A. 19) 
becomes 



n 



-x^w,.x,. 



+ 



fix) h^Vfixf 
hi^Wfix) hi^fix)!, 

0{h) + 0,(^^) 0{hy^) + o,( „,,.,,v.,. ) 



1/2 



0{h^) + o. 



1 



PV nl/2/id/4-l 
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Since /i — > and nh'^^'^ — )■ oo as n — > oo, we know ^X^W^X^. is invertible with 
probability tending to 1 as — )■ oo. Also, since f{x) + 0{h) + Op ( ^1/2^^/4 ) and 
f{x)Id + Oih?) + Qp( ,^i/2;^^d/4-i ) are also invertible with probability tending to 1 
as n — )■ cxD, by the binomial inverse theorem, 



-X^W.X, 



n 



fix)-' 



-f{x)-'Vf{x) h-' 



d J 



+ 



0{h) + Op^^i/2;jd/4 
0{h^^^) + Opi ^i/2^d/4+l/2 



0(1) + 0„ 



2/jd/4+l/2 
1 



p I „l/2^d/4+l 



Next we consider -X'^W x0.m{x) . By a direct calculation. 



qi 
q2 



Note that, for any n x n matrix Z and any n x 1 column vector v, 

v'^Zv = tr(Zff^). 

By (A. 25) and the CLT, we have 

1 " 

qi = -y^Kf,{Xi,x){Xi-xf^{Xi-x) 
1 " 

= ti[sj-J2KhiXi,x){Xi-x){Xi-xf^ 

1=1 

= tr{ml{l))+Op 
We evaluate tr(i3(£2) by 



1 



= hf{x) [ [ K{t)e^}iessm{x)et'^+'dtde + 0{h^) 

Js'i-^ Jo 

= h^f{x)Am{x) + Op{h'), 



(A.23) 



(A.24) 



(A.25) 



(A.26) 



(A.27) 



where the first equality comes from Lemma A. 2. 5, the second equality comes from 
(A. 13) and (A.25) and the last equality holds due to the symmetry of S'^~^ and the 
definition of the Laplace-Beltrami operator. 
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Then we evaluate q2 in (A. 24). Choose {6^}^^^ as an orthonormal basis of and 
rewrite Xi — x = J2k=ii-^i ^k)^k- Note that the random variables Kh{Xi, x){Xi — 
x){Xi — xY {Xi — X, Cfc) are independent. By (A. 25) and the CLT, 

1 " 

q2 = -^i^,.(Xi,x)tr(^(Xi-x)(X;-x)^)sJ(X;-x) (A.28) 
1=1 

p 1 " 

= B^J2tr(^S)-J2KhiXi,x)iXi - x)iXi - xf {Xi - x,ek))ek 

k=l 1=1 
P ^ 

= B^J2^r(mU,il))ek + 0J—r^). 

By the same arguments as those for qi, we have 
p 

5j^tr(i54,,(l))e, 

k=l 



2 /^1,2 



where the first equality holds by Lemma A. 2. 5 and the second equality holds by 
(A.13), (A.14), (A.25) and (A.28). 
As a result, (A. 24) becomes 



n 



+ 



h^f{x)Am{x) 
h^j§^\ ^^Hessm(x)^^V,/(x)d^ 



(A.29) 



Lastly, since m G C^(M) and M is compact, a simple uniform bound combined with 
(A. 23) yields that the remainder term in (A.18) is Op{hh%t). Plug (A.23), (A.29) 
and this result into (A.18), we conclude that 

E{m{x, h) - m{x)\X} = h^^Am{x) + Opih' + hh%l) + oj— (A.30) 
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Next consider the conditional variance. A direct calculation gives 
Var{m(a;, 



(A.31) 



n \n 



1/1 



n 



n 



By the CLT 



n 



-X^W,6,W,X, 



}XtiKl{Xux)x,xJa\X,) 



+ 



Or 



1 



Or, 



„l/2^3d/4 
1 



Or 



1 



P \ „l/2/i3d/4-l/2 



Or 



1 



'P V nV2/i3d/4-l/2y ^pyri^/2^3d/4,-l 

We evaluate the expectations by the same arguments as those above and get 



n 



X^W,6,.W,.X, 



h 2 



hvl hd~^H2,20-'^ix)f{x)Id 



(A.32) 



+ 



0{h) + 



Opih^ + hh%l) + o 



0,{h^ + hhl^)+oJ^^ 



p\ 1 d 1 



Op{h^) + o 



P \ Id 



where i;^, 



/^2.20-(a:^) 



[2/Va + (TV/](a;). Due to (A. 23) and (A.32), (A.31) becomes 



Var{m(x, /i)|A'} 



1 /i2,oo-^(a;) 



O 



+ 



(A.33) 



/(x) ''\n¥/^-^ n3/2;i3d/4y 

Thus, the asymptotic conditional MSE in (4.5) follows from (A. 30) and (A.33). In 
conclusion, when hp^a < h, the minimal asymptotic conditional MSE is achieved 
when n/i'^/^ x /i"^, as is claimed. Note that hpca and h are thus related by /ipca = 

^(d+4)/(d+l) ^ 

The conditional bias of the estimator Va.m(x, h), ior i = 1, . . . , d, are evaluated 
by following exactly the same lines as in the proof of (A. 18): 



E{Va,m(x, h) - Va.m(x)|A'} = v^.i^l' 



(A.34) 
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By plugging (A. 23) and (A. 29) into (A. 34), we obtain 

E{V^(x, h) - VQ^m{x)\X] (A.35) 

d f{x) i|/(x) 
+ 0(0 + h^4c^ + Op(^^) . 

The conditional variance term of VQ^m{x,h) comes from (A. 23) and (A. 32): 

Var{ V^(x, h) \X} (A.36) 

1 ^/^^,^-^(-)/(-)+oJ^Uo/ ' 



nh'^/^+^ /ii,2 "Knh'^l^) ^\n^/^h^dl^+^)' 

The conditional MSE is then obtained directly and it leads to the conclusion that the 
minimal asymptotic conditional MSE is achieved when n/i*^/^ x h^^. □ 

A.2.2 [Proof of Theorem 4.2] 

Proof. The proof is smilier to that of Theorem 4.1 except the boundary effect. We 
use the same notation {Uk{x)}'l^^, {6^}^^^^ as those in the proof of Theorem 4.1 and 
the same assumption for l. Note that the equalities (A. 11) and (A.31) still hold. Take 
y = exp^ tO e M, where t = 0{\/h) and ||^|| = 1. By Lemma A.2.2, Lemma A. 2. 6 
and (A. 12), we have for j = 1, . . . ,d 

{UJ, e,) = {UJ, U,{x) + Op{h%l)wj) (A.37) 

= {,{y)-x,U,{x)) - ^^{ime),U,{x))+0,{h%th'/') + 0{h'/') 

= y, + o{h%ihy^ + h%lh), 

By the same arguments as that in (A. 16) and by (A.37), we have 
m{y) - m{x) = tOVm{x) + — Hessm(x)(6', 6) + 0{1?) 



d 

^ |2 

j=l i,j=l 



d 1 

^(ti^^,ej)Va^.m(x) + - {ti^O, ei){tL^9, ej)Hessm(x)((9^, dj) + 0{hi 



■.y^Vm{x) + ^t/^Hessm(x)?/ + 0,ih%lh'/' + hlHh), 
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which leads to the following equality 



m[x] 



since the kernel K is compactly supported. By a direct calculation, the conditional 
bias is reduced to 

¥.{m{x)-m{x)\X} = (X^W.X.)-iX^W.[n„(x)/2+0,(/iJ/4/ii/2^/iJ/^,/i)]. (A.38) 

By taking the boundary effect into consideration and the similar arguments as those 
in the proof of Theorem 4.1, we have 

1. 



n 



X^W.X,. = /(x)Cz/i,,.C 



+ 



Op{Vh) + oJ^ 



_ d 



where ui^^ and C are respectively defined in (4.6) and (4.8). The invertibility of 
^X!^W^X^ follows from the assumption (4.4) and (4.1). Indeed, from (4.4) and (4.1) 
we know 



/(a;)i^i,x,ii = f{x) / K{y)dy > 0, 

and hence Minkowski's inequality implies that with probability tending to 1, the 
invertibility holds. The binomial inverse theorem yields that 



1. 



-xjw.x. 



n 



+ 



0,(1) + oJ^^) Op(/i-i) + oJ^ 



(A.39) 



where 



yll ,,12 . 

,,12 \T ,,22 



('^l,a;,ll - '^l,x,12'^li,22^^l!x,12) \ 



'^S •= (^1.^.22 - yl,x^2^\,xMy\,x,i'^ ^ , and := -{i^iXn^hx, 12)^^1 



22 

X ■ 



The term iX^W^.n^ (a;) in (A.38) is evaluated by following the same lines as those 
in (A. 24) except for the boundary effect. By the same arguments as those used to 
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calculate the term qi in (A. 24), we have 

qi = / Kn{y.x){y-xf^{y-x)f{y)dV{y) + 0, 



cxp^S)(x) 



n 



-/2f^d/4- 



hf{x) 



[ K{\\u\\)u^Ressm{x)udu + 0,{h'/') + oJ } 



where the first equality comes from the CLT and the second equality comes from 
Lemma A. 2. 4 and the change of variable. Choose {e^}^^^ as an orthonormal basis of 
MP. By the same arguments as those in (A. 28), 
p ^ I. 

Kh{y,x){y - x){y - x)^{y - x,ek) 



q2 = B^Y^^rf^j ! 

k=i 



'exp^Z)(x) 



x/(y)dy(y) e, + Op 



1 



1 , d_3 



K{\\u\\)u^Y{e?,?,m{x)uu<lu + Op{h^) + O 



n2ni 2 



where the first equality comes from (A. 25) and the second one comes from the as- 
sumption hpca < h. Since m E and M is compact, the remainder term in (A. 38) 
is bounded by Oplkpith^^"^ + hpil.h). Thus, since hpca < ^ by assumption, it follows 
from (A.25) that 



E{m(x, h) — 'm{x)\X} 



(A.40) 



T,,-l 



1 '^\,x 



fr(||M||)M"^Hessm(x)M 



u 



du 



n2 ft4 



h- 



tr(Hessm(x)z/i^2:^22) 



Op{hlcJi-2 + hlcah) + Op 



The conditional variance is evaluated by the same lines as those in (A. 32): 
1, 



n 



(A.41) 



1 



p I „l/2ftd/4 



Op{h)^0, 



1 



Op{h) + C'p^_^i/2^i/4-i/2 j Op{h^/'^) + C'p^_^i/2^V4-l) 
which when combined with (A. 39) leads to 

(XjW,X,.)-'(X^W,©,W,X,)(XjW,Xj-^ (A.42) 
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n 



1 (j\x) 

hi fix) 



Op{l) + 0._ 



_ d 



Op(i) + oJ^-^ 



1 



± 11 



From (A. 42), since viC ^ = v'f , we have 



Var{m(x, 



f 1 1 \ 

"I" I ,d_i + 3,3d)- 

h 9. 9. n 9 n A ' 



nh^ fix) " ^nh'i 2 n^h't 

Putting this together with (A. 40) we obtain the conditional MSE of rh{x, h). 

With (A.39), (A.41) and the fact that vf+iC-^ = h-^/^ the conditional bias 
and the conditional variance of the estimator of the first order covariance derivative 
of m{x) are clear by the same calculation. For i = 1, . . . ,d, 



E{VaMx,h) -VaMx)\X} 



(A.43) 



K {\\u\\)u^}iessm{x)u 



'- u 



du 



+0p(/^J/t + O^) + 0,(^ 



+1 



and 



Nai{Vd,m{x,h)\X} 



fix) 



(A.44) 



h 2^ 2 n 9 h 



Then the conditional MSE of Va-m(x, h) follows from the above results. 



□ 



A.2.3 [Proof of Corollary 4.1] 

Proof. The proof is finished by simplifying the conditional bias term (A. 40) when the 
boundary dM is smooth. We should show that the conditional bias term is actually 
the linear combination of second order covariant derivatives of m at x. We first 
symmetrize the integration domain 2)(x) as follows. Suppose 

Xq = argmmdiy,x) 

y£dM 
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and 



h(x) = min d(y,x) < Vh. 



Choose a normal coordinate {(9j}f^^ on the geodesic ball B^{x) around x so that 
xs = exp^{h{x)dd{x)) . Divide D{x) into slices S*^ C M.'^~^, that is, 

where 

:={vGM<^-i:||(v,r/)||K. <V^}, 
and T] E [— V^, v^]- Define 5^ so that 

where Ri is the reflection of M'^ with respect to the i-th coordinate. The symmetriza- 
tion of D{x) is thus defined as 

Since dM is a smooth (d — l)-dimensional manifold, by Lemma A. 2. 2 we can ap- 
proximate exp~^(exp^ fl dM) by a homogeneous degree 2 polynomial defined on 
Texp-i(a;a) exp~^(exp^. fl dM), whose graph is symmetric in all coordinates, with 
error 0{h^/'^). Thus, the error of approximating by 5^ is of order 0{h^/'^) and 
hence the volume of the set D{x)^D{x) is 

Vo\{T){x)AT){x)^ =0{h'^'^+^). (A.45) 



We also denote 



a{x) := / K{\\u\\)du, (A.46) 



/3{x) := / K{\\u\\)uddu, (A.47) 
r(x) := diag(7i(x),...,7d(x)), (A.48) 



7i(x) := / ir(||^i||KdM, ^ = l,...,c?. (A.49) 



Thus, since 2)(x) is symmetric in the first d — 1 directions, by (A.45) we have 
K{\\u\\)du= [ K{\\u\\)du + Oih) = aix) + 0{h), 
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K{\\u\\)u^du 



K{\\u\\)u^du + Oih) = (3v^{x) + Oih), 



and 



/ K{\\u\\)uu^du 



K{\\u\\)uu^du + 0{h) = T{x) + 0{h). 



Hence, we get the following equations: 



+ 0{h), 



^'"^ a{x) - P{xY-td{x 
ull = V{xr^ + 0{h). 

Similarly, by the symmetry of ©(x), we have 



( 1 1 M 1 1 ) w'^Hessm (x) M 



1 

u 



du 



4 

Vh 



K {\\u\\)u^}iessm{x)u 



1 

u 



Plugging (A.50), (A.51), (A.52), and (A.53) into (A.40) leads to 

tr(Hessm(a;)t/i,^,22) ELi lk{x)jd{x)Vl^ Q^m{x) 



(A.50) 

(A.51) 
(A.52) 



du + 0{h). (A.53) 



(A.54) 



2(z/i,x,ii - '^i,x,i2i^ii,22i^i,x,2i) 2[a(x)7d(x) - /3(x)2] 

which finishes the claim. Moreover, by the Cauchy-Schwartz inequality, a{x)'jd{x) — 
f3{x)^ > for all x G M^. Since M is compact, the uniform boundedness of 

a{x)-ya{x)-l3{xy 



holds as is claimed. 



□ 
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